A Cost Model and Index Architecture for the Similarity Join
نویسندگان
چکیده
The similarity join is an important database primitive which has been successfully applied to speed up data mining algorithms. In the similarity join, two point sets of a multidimensional vector space are combined such that the result contains all point pairs where the distance does not exceed a parameter ε. Due to its high practical relevance, many similarity join algorithms have been devised. In this paper, we propose an analytical cost model for the similarity join operation based on indexes. Our problem analysis reveals a serious optimization conflict between CPU time and I/O time: Fine-grained index structures are beneficial for the CPU efficiency, but deteriorate the I/O performance. As a consequence of this observation, we propose a new index architecture and join algorithm which allows a separate optimization of CPU time and I/O time. Our solution utilizes large pages which are optimized for I/O processing. The pages accommodate a search structure which minimizes the computational effort. In the experimental evaluation, a substantial improvement over competitive techniques is shown.
منابع مشابه
A Novel Architecture for Detecting Phishing Webpages using Cost-based Feature Selection
Phishing is one of the luring techniques used to exploit personal information. A phishing webpage detection system (PWDS) extracts features to determine whether it is a phishing webpage or not. Selecting appropriate features improves the performance of PWDS. Performance criteria are detection accuracy and system response time. The major time consumed by PWDS arises from feature extraction that ...
متن کاملA Fast Algorithm for high-dimensional Similarity Joins
Many emerging data mining applications require a similarity join between points in a highdimensional domain. We present a new algorithm that utilizes a new index structure, called the -kdB tree, for fast spatial similarity joins on high-dimensional points. This index structure reduces the number of neighboring leaf nodes that are considered for the join test, as well as the traversal cost of nd...
متن کاملHigh Performance Data Mining Using the Nearest Neighbor Join
The similarity join has become an important database primitive to support similarity search and data mining. A similarity join combines two sets of complex objects such that the result contains all pairs of similar objects. Well-known are two types of the similarity join, the distance range join where the user defines a distance threshold for the join, and the closest point query or k-distance ...
متن کاملImplementation of Low-Cost Architecture for Control an Active Front End Rectifier
In AC-DC power conversion, active front end rectifiers offer several advantages over diode rectifiers such as bidirectional power flow capability, sinusoidal input currents and controllable power factor. A digital finite control set model predictive controller based on fixed-point computations of an active front end rectifier with unity displacement of input voltage and current to improve dynam...
متن کاملHigh-Dimensional Similarity Joins
Many emerging data mining applications require a similarity join between points in a high-dimensional domain. We present a new algorithm that utilizes a new index structure, called the -kdB tree, for fast spatial similarity joins on high-dimensional points. This index structure reduces the number of neighboring leaf nodes that are considered for the join test, as well as the traversal cost of f...
متن کامل